Sentiment Analysis on Amazon Fine Food Reviews:

Introduction:

In this project, I produced a sentiment analysis covering over 500,000 reviews from an Amazon Fine Food Review dataset. I classified all positive and negative customer reviews and then created word clouds, plotly visualizations, and a text classification model to display my analysis further.

Data :

For this project, I used the Amazon Fine Food Review dataset found on Kaggle.

Methodology:

To prepare for this analysis, I visualized the product scores from the dataset in a histogram using the plotly library.

From the blue histogram, we can see more positive customer ratings than negative. Therefore, the majority of Amazon’s product reviews are positive.

Methodology:

Next, I created a word cloud to show the most frequently used words in the text (review) column. Before starting, I checked for any null values and used natural language processing such as NLTK stopwords before generating my word cloud.

The above code shows that column 'Text' doesn't have any null values.

Review Word Cloud

Methodology:

Next, I added a sentiment column by classifying only positive and negative reviews using the dataset's 'Score' column. For this sentiment, I categorized all positive reviews as scores > 3, negative for scores < 3, and dropped all neutral scores, which = 3. Note, the sentiment column will later be used as training data for the sentiment classification model.

Methodology:

After building the sentiment column, I also created word clouds to display the most frequently used words for both positive and negative product reviews, respectfully. In addition, I made a product sentiment histogram to show the distribution of reviews with sentiment across the dataset.

Product Sentiment Histogram

From the orange histogram, we can see that the product sentiment is more positive than negative.

Methodology:

Finally, I created a text classification model to train and establish the accuracy of my data. I start by pre-processing the textual data using NLTK to remove special characters, lowercasing text, and stopwords. Then, I test the accuracy of the sentiment model by performing the Multi Nominal Naive Bayes Classification function using the scikit-learn library.

Data Pre-Processing

As a result, the overall classification rate has an approx. 90.5% accuracy!